Method and apparatus for accelerating preliminary operations for cryptographic processing

ABSTRACT

A method and apparatus for cryptographic data processing, includes determining a first modulus having up to a first number of binary digits. A large integer is received which has up to a second number of binary digits that is greater than the first number of binary digits. The first modulus and the large integer are sent to a first processor for computing a first residue of the large integer modulo the first modulus. Before the first processor finishes computing the first residue, the first modulus is also sent to a second processor for computing a second residue of two raised to a power of twice the first number of binary digits modulo the first modulus. The first residue and the second residue are used as input to a third processor that computes a cryptographic result based on the large integer.

FIELD OF THE INVENTION

The present invention generally relates to cryptographic dataprocessing. The invention relates more specifically to a method andapparatus for accelerating computation of Montgomery multiplicationconstants or modular reduction, or both, which are preliminaryoperations for widely used decryption methods.

BACKGROUND OF THE INVENTION

The security of many cryptographic algorithms lies in the mathematicaldifficulty in factoring large integer values (whole numbers withhundreds of decimal digits or more). Factoring a particular integermeans determining the unique set of prime numbers that, multipliedtogether, form the particular integer. A prime number is a number thathas as factors only the number itself and the number one.

Many cryptographic algorithms also employ modulo arithmetic in whichintermediate and final results are expressed as an integer in the rangefrom 0 to m−1 for a number m called a modulus. The modular reductionoperation is here represented by the term “mod.” The modular reductionoperation has two parameters, the modulus m and an integer a, and oneresult, the integer b such that a=b+k*m for some integer k. Effectively,the output b of the modular reduction operation is the remainder, orresidue, of dividing the input integer a by the modulus m. If a is lessthan m, then b is the same as a. The modular reduction operation isherein expressed as “a modulo m equals b” and written asa mod m=bAlternatively, this is expressed as “a is equivalent to b modulo m” andwritten asa≡b [mod m]where [mod m] in square brackets indicates the immediately precedingnumber or variable is the output of the modulo operation. That is, theinteger b always lies between 0 and m−1, whereas the integer a need not.The integer b is the residue of a modular reduction operation on theinteger a and the modulus m. Other modular arithmetic operationscommonly employed in cryptographic processing includes modular addition(the modular reduction of a sum of two integers), modular subtraction(the modular reduction of a difference between two integers), modularmultiplication (the modular reduction of a product of two integers),modular division (the modular reduction of a quotient of a first integerdivided by a second integer) and modular exponentiation (the modularreduction of a first integer raised to the power of a second integer).

Modular multiplication and exponentiation are often performed based onMontgomery's algorithm, well known in the art, and described in thearticle “Modular Multiplication without Trial Division,” by P. L.Montgomery, in Mathematics of Computation, v 44, n. 170, 1985, pp.519–521.

Cryptographic processing systems can be implemented in software, butspeed is often significantly increased by implementing some of the stepsin special purpose hardware such as electronic circuits. Such hardwaretypically takes the form of an application specific integrated circuit(ASIC), a “chip,” which is composed of separate blocks of circuitry thateach performs a certain combination of one or more steps of thecomputation. The blocks of circuitry are connected so that the output ofone block is fed as input to another block. At many steps, a set ofparallel connections between blocks is devoted to pass every binarydigit (bit) of input and output during each processing cycle. Efficient,thoroughly tested, small footprint blocks have been developed forseveral modulo computations. Common circuit blocks employed incryptographic processing systems include modular reduction (MR) blocks,modular addition (MA) blocks, modular subtraction (MS) blocks, modularmultiplication (MM) blocks, modular division (MD) blocks and modularexponentiation (ME) blocks.

In designing and building circuits to perform cryptographic processingone often has to trade the size of the circuitry for latency. The sizeof the circuitry is often measured in number of fundamental componentscalled gates. The latency is often measured in the number of processingcycles. A gate transforms an input set of one or more bits to an outputset of one or more bits during each processing cycle. Chips with fewergates that are reused in subsequent processing cycles require moreprocessing cycles to complete processing and increase latency. Chipswith more gates that can complete processing in fewer processing cyclesare larger, cost more and consume more power than chips with fewergates. As a consequence, there are many alternatives for thearchitecture of the individual blocks and the arrangement of multipleblocks in processing systems.

The number of gates on a block is also related to the maximum number ofbits of the input to and output from the block during one processingcycle; the more bits the more gates. The blocks are usually designed forintegers up to a certain maximum number of bits. For example, existingMR blocks use precision division or successive subtractions for alimited number of bits, typically 128 bits or fewer. The use ofprecision division or successive subtraction becomes unwieldy at largerinput and modulus sizes, such as at 1024 bits and 2048 bits. The numberof processing cycles used for successive subtractions increases with thedifference between the number of bits for the large integer and thenumber of bits for the modulus. This difference can sometimes be quitelarge, on the order of 1000 bits.

For some cryptographic processing, the modular reduction is performed afew times on a very large integer with a number of bits much greaterthan existing MR blocks and more frequently on integers having a numberof bits less than the maximum for existing MR blocks. An examplecryptographic algorithm widely deployed is RSA invented by Rivest,Shamir and Adleman, and described in the reference Applied Cryptography,Protocols, Algorithms, and Source Code in C, by Bruce Schneier, 1996,John Wiley & Sons, New York (hereinafter referenced as Schneier). Inthis algorithm, the Chinese Remainder Theorem, well known in the art, isemployed to break down a larger problem with a large modulus M, where Mis equal to the product of two primes P1 and P2, into two smallerproblems with the smaller moduli P1 and P2. The residue of large text Tmodulo P1, and the residue of T modulo P2, are needed (where T is thecipher text during decryption).

In current implementations, the smaller residues, e.g., T mod P1 and Tmod P2, are used in subsequent processing steps that employ hardwaredesigned to handle integers of the size of the residues, e.g., of thesizes of P1 or P2, but not of the size of the large integer, e.g., thesize of T (also the size of M). Therefore the residues of the largetexts are often computed in software and then passed as input to thehardware to continue the processing. The software computation of theresidue is a performance hindrance.

Based on the foregoing, there is a clear need for an MR block thatprovides a smaller residue of a very large integer, which is not toocostly in chip size and latency.

Furthermore, Montgomery multiplication modulo modulus m involves afactor called a Montgomery Constant that depends on m. In a pastapproach, the Montgomery Constant is computed in software for eachmodulus involved in the cryptographic processing and stored in one ormore registers on the cryptographic processing chip. In the RSAalgorithm, three moduli (M, P1, P2) are used for eachprivate-key-public-key pair, so that three Montgomery Constants have tobe determined for the three moduli and stored in three registers on thechip, consuming valuable chip area to support a large number of keypairs. Assuming use of 4,000 key pairs, which is reasonable for apractical implementation, the memory required to store the threeMontgomery Constants (M, P1, P2) is approximately 12 megabits, excludingother pre-calculated constants.

Other cryptographic processing algorithms that compute MontgomeryConstants include Diffie-Hellman key generation and the DigitalSignature Algorithm (DSA), both well known in the art and described inSchneier. To support multiple key pairs, multiple sets of threeregisters can sometimes be involved, consuming even more valuable areaon the chip. For example, in the Ephemeral Diffie-Hellman key pairgeneration algorithm, well known in the art, the moduli can possiblychange for each secret key generation. In this algorithm, the constantscannot even be pre-computed at all, but are necessarily computed afterinitiation of each exchange sequence.

Based on the foregoing there is a clear need for computing MontgomeryConstants as needed for Montgomery multiplication in MM and ME blocks,so that the number of registers on the chips to store MontgomeryConstants can be reduced without excessively increasing latency.

Based on the foregoing, there is also a clear need for a cryptographicprocessing system that both computes Montgomery Constants as needed andprovides hardware components for modular reduction of very largeintegers without excessively increasing latency.

The past approaches described in this section could be pursued, but arenot necessarily approaches that have been previously conceived orpursued. Therefore, unless otherwise indicated herein, the approachesdescribed in this section are not prior art to the claims in thisapplication and are not admitted to be prior art by inclusion in thissection.

SUMMARY OF THE INVENTION

The foregoing needs, and other needs and objects that will becomeapparent from the following description, are achieved in the presentinvention, which comprises, in one aspect, an apparatus for generating adigital output signal representing a modular reduction of a largeinteger. A first input receives a first input signal that represents amodulus having up to a first number of binary digits. A second inputreceives a second input signal that represents the large integer havingup to a second number of binary digits that is greater than the firstnumber of binary digits. A third input receives a third input signalthat represents a constant based on a reciprocal of the modulus. Acircuit is configured for generating an output signal representing aresidue of the large integer modulo the modulus. The output signal isbased on the first input signal and the second input signal and thethird input signal. The circuit does not perform a division by themodulus, and does not consume a number of processing cycles as great asthe first number of binary digits. An output presents the output signalthat represents the residue.

According to another aspect of the invention, an apparatus forgenerating a digital output signal representing a residue of aparticular power of two includes an input that receives input data thatrepresents a modulus having up to a number of binary digits. A circuitis configured for determining the residue of two raised to a power oftwice the number of binary digits modulo the modulus. An output presentsthe digital output signal representing the residue of two raised to thepower of twice the number of binary digits. This signal represents theMontgomery Constant for the modulus.

According to another aspect of the invention, a method for generating adigital output signal representing a residue of a particular power oftwo includes receiving input data that represents a modulus having up toa number of binary digits. A first data element is initialized with datathat represents two raised to a power of the number of binary digits. Adifference is obtained by subtracting the modulus from a valuerepresented by data in the first data element. It is determined whetherthe difference is negative. If it is determined that the difference isnot negative, then data that represents the difference shifted towardmore significant digits by one binary digit is placed into the firstdata element. Based on the data in the first data element, a digitaloutput signal is provided that represents the residue of two raised to apower of twice the number of binary digits modulo the modulus. Thisoutput is the Montgomery Constant for the modulus.

According to another aspect, a method for cryptographic data processingincludes determining a first modulus having up to a first number ofbinary digits. A large integer is received which has up to a secondnumber of binary digits that is greater than the first number of binarydigits. The first modulus and the large integer are sent to a firstprocessor for computing a first residue of the large integer modulo thefirst modulus. Before the first processor finishes computing the firstresidue, the first modulus is sent to a second processor for computing asecond residue of two raised to a power of twice the first number ofbinary digits. The second residue is the Montgomery Constant for thefirst modulus. The first residue and the second residue are used asinput to a third processor that computes a cryptographic result based onthe large integer.

In other aspects, the invention encompasses an apparatus and a computerreadable medium configured to carry out the steps of the foregoingmethods.

The apparatuses and methods of these aspects allow both MontgomeryConstants and modular reduction of very large integers to be implementedin hardware that is operated in parallel, significantly decreasing thelatency of cryptographic processing.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example, and not by wayof limitation, in the figures of the accompanying drawings and in whichlike reference numerals refer to similar elements and in which:

FIG. 1 is a block diagram that illustrates an overview of acryptographic processing system, according to an embodiment;

FIG. 2 is a block diagram that illustrates a modular arithmetic blockfor the cryptographic processing system of FIG. 1, according to anembodiment;

FIG. 3 is a flowchart that illustrates a high level overview of oneembodiment of a method for using the modular arithmetic block of FIG. 2;

FIG. 4 is a flowchart that illustrates one embodiment of a method tocompute a Montgomery Constant for cryptographic processing;

FIG. 5A is a flowchart that illustrates one embodiment of a method toperform modular reduction without precision division or excessivesubtractions;

FIG. 5B is a flowchart that illustrates one embodiment of a method toperform a step of the method of FIG. 5A;

FIG. 6 is a block diagram of a MR block to perform modular reductionwithout precision division or excessive subtractions, according to anembodiment; and

FIG. 7 is a block diagram that illustrates a computer system upon whichan embodiment may be implemented.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

A method and apparatus for accelerating preliminary operations forcryptographic processing is described. In the following description, forthe purposes of explanation, numerous specific details are set forth inorder to provide a thorough understanding of the present invention. Itwill be apparent, however, to one skilled in the art that the presentinvention may be practiced without these specific details. In otherinstances, well-known structures and devices are shown in block diagramform in order to avoid unnecessarily obscuring the present invention.

Embodiments are described herein in sections according to the followingoutline:

1.0 OPERATIONAL CONTEXT 2.0 STRUCTURAL OVERVIEW 3.0 FUNCTIONAL OVERVIEW4.0 MONTGOMERY CONSTANT COMPUTATION 5.0 MODULAR REDUCTION COMPUTATION6.0 MODULAR REDUCTION BLOCK 7.0 HARDWARE OVERVIEW 8.0 EXTENSIONS ANDALTERNATIVES

1.0 Operational Context

To illustrate the modular arithmetic methods and apparatus, it isassumed that an electronic integrated circuit is fabricated forperforming modular arithmetic operations to support RSA private keydecryption. However, embodiments of the invention are not limited tothis context, but may be employed in other contexts as well, such aspublic key-private key exchange, encryption, and decryption, and digitalsignatures. For example, embodiments may be used in processing systems,like RSA decryption, that employ either the Chinese Remainder Theorem(CRT) or modular exponentiation or modular multiplication or anycombination of these. For example, embodiments may be employed as ameans of designing such circuitry, as a software means for generating aMontgomery constant, as a hardware or software means of exchanging keysusing Diffie-Hellman and ephemeral Diffie-Hellman, of verifying DigitalSignature Algorithm (DSA) signatures, and of verifying RSA signatures.

The Digital Signature Algorithm (DSA) is a well-known digital signaturealgorithm promulgated by the National Institute of Standards andTechnology (NIST). Diffie-Hellman is a well-known public-key private keyexchange protocol. Both DSA and Diffie-Hellman are described inSchneier, referenced above. The DSA is used as the basis of thegovernment Digital Signature Standard (DSS). Use of DSA is required inmany popular network security protocols such as Secure Sockets Layer(SSL) and Internet security protocol (IPSec).

Both RSA and DSA employ public key cryptography techniques based on twokeys known as a public key and a private key. The two keys aremathematically related, but the private key cannot be determined fromthe public key. In a system implementing public key technology, eachparty has its own public/private key pair. The public key can be knownby anyone; however, no one should be able to modify it. The private keyis kept secret. Its use should be controlled by its owner and it shouldbe protected against modification as well as disclosure.

In general, in public key cryptography, a sender uses the recipient'spublic key to encrypt a plain text message; the resulting encryptedmessage is known as cipher text. The plain text may comprise data fortext, voice, images, video, or any other data. The cipher text is sentto the recipient. The recipient can decrypt the message by providing therecipient's private key to a decryption algorithm that processes themessage. Because deriving either party's private key from either party'spublic key is mathematically impractical, a malicious party cannotpractically decrypt the message

RSA decryption makes use of the numeric integer parameters E, D, P1, P2and M. E is the public key published by the recipient of a message foruse in encrypting plain text (X) to generate cipher text (C) to be sentto the recipient. D is the private key used by the recipient to decryptthe cipher text C and regenerate the plain text X. The parameters P1 andP2 are prime numbers whose product M is a modulus used on the ciphertext C and plain text X. That is, according to RSA encryption/decryptionM=P1*P2  (1a)C=X^(E) mod M  (1b)X=C^(D) mod M  (1c).Let K1 be the number of bits in P1 and K2 be the number of bits of P2.It should be noted that integers X, C and M each involve a number ofbits that is about the sum of K1 plus K2.

To improve performance, the Chinese Remainder Theorem (CRT) is employedto take advantage of the fact that M is the product of two primes. Fordecryption, the CRT solution takes the form of evaluating the followingexpressions:D1=D mod (P1−1)  (2a)D2=D mod (P2−1)  (2b)F1=P2^(P1−1) mod M  (2c)F2=P1^(P2−1) mod M  (2d)C1=C mod P1  (3a)C2=C mod P2  (3b)X1=C1^(D1) mod P1  (4a)X2=C2^(D2) mod P2  (4b)X=[(X1*F1 mod M)+(X2*F2 mod M)] mod M  (5)It is noted that expressions 2a, 2b, 2c, 2d depend only on the values ofP1 and P2 and therefore can be evaluated before the cipher text C isgenerated or received.

Steps 3a and 3b involve the modular reduction of a large integer C thathas many more bits than either modulus P1 or P2. For example, when P1and P2 each include 1024 bits, integer C would have 2047 or 2048 bits,about twice the number of bits in either P1 or P2. Blocks devoted tomodular reduction usually depend on large precision division or multiplesubtractions. Sometimes, rather than devote chip real estate to a simpleMR block, blocks for other operations are reused for some cycles toobtain modular reduction residues of input integers. For example, a MMblock is used determine the product of one and the input integer.However, none of these conventional approaches have provided MR blocksthat accept very large input integers, such as input integers with morethan 1024 bits.

Steps 4a and 4b involve modular exponentiation that may be performedusing modular exponentiation (ME) blocks that employ Montgomerymultiplication. In addition, step 5 involves two modular multiplies thatmay also be performed using ME blocks that employ Montgomerymultiplication.

Montgomery multiplication for a modulus m involves a Montgomery Constant(MCm) that depends on the number of bits (K) in the modulus m.Specifically, two variables, K and R, are defined by the following twoexpressions2^(K−1) ≦m<2^(K),  (6a)R=2^(K).  (6b)A Montgomery multiplication sub-block (MMS) performs the followingoperation on two operands A1 and A2.MMS(A1,A2)=A1*A2*R ⁻¹ mod m.  (6c)The MMS can be used to determine the product B of two operands, A1, A2as follows:B′=MMS(A1, R ²),  (7a)B=MMS(B′, A2).  (7b)The term R² used in equation 7a depends only on the modulus m and is thecalled the Montgomery Constant for modulus m (MCm). That is,MCm=R ² mod m=(2^(K))² mod m=2^(2K) mod m.  (7c)

The conventional approaches to providing the Montgomery Constant computethe constant in software for multiple moduli involved in multiplicationoperations and to store the results on registers available to theMontgomery Multiplication sub-block (MMS). In some implementations, theregister size devoted to stores the Montgomery Constants can grow largeand consume valuable space on integrated circuits.

2.0 Structural Overview

FIG. 1 is a block diagram that illustrates an overview of acryptographic processing system, according to an embodiment. The systemincludes an encryption/decryption integrated circuit, in which anembodiment is implemented. A client device 110 on a trusted localnetwork 150 is connected to a non-secure, public network 155 through agateway device 130. Client device 110 may be a network infrastructureelement such as a router, switch, etc., that executes an SSL agent orIPSec process, for example. Alternatively, client device 110 may be asoftware process of an end station device such as a personal computer,workstation, server, etc. The gateway device 130 may be a computer or anetwork device such as a router. To encrypt and decrypt text, anencryptor/decryptor ASIC 131 is included in gateway device 130.Elsewhere connected to the network 155, a second client device 112 isconnected through a second local network 152 and a second gateway 132with a second encryptor/decryptor ASIC 133.

A first user of a process on client device 110 sends an electronic plaintext message X to gateway 130 for encryption. A user, in this context,may be a programmatic process or software agent, as well as a humanuser. The message X may be a flow of data packets, an electronicdocument, or any other associated electronic data. Based on the addressof the client device 130, or some other means of identifying the firstuser, a process on the gateway invokes the ASIC 131 for encrypting themessage X with the shared parameters for the encryption algorithm alongwith the public key for the recipient at client device 112. Cipher text,(e.g., the integer C) is sent over the public network for client device112.

The information for the client device 112 is received at gateway device132, which invokes the ASIC 133 for decrypting the cipher text intoplain text. The gateway device 132 passes the plain text to the ASIC133. If the ASIC 133 is able to decrypt the cipher text (e.g., when theplain text X is generated, or when a digital signature is verified) thenthe message X is sent to a process on client device 112 over localnetwork 152.

FIG. 2 is a block diagram that illustrates a modular arithmetic block200 for the cryptographic processing ASIC, e.g., 131, 133 of FIG. 1,according to an embodiment. The modular arithmetic block 200 includes amodular exponentiation farm (ME farm) 250 of multiple ME blocks 250 a,250 b, 250 c, 250 d among others, represented by ellipsis 251. In theillustrated embodiment, the ME farm 250 includes 16 ME blocks. Each MEblock is able to perform modular exponentiation, modular multiplication,or modular reduction for up to three 1024-bit inputs (a modulus and oneor two operands) based on control signals at one or more control inputs.The arithmetic block also includes an arithmetic controller 252 thatdetermines which bits form which operands on which ME block and thatprovides the control signals for selecting exponentiation,multiplication or reduction. For example, under control of arithmeticcontroller 252, the ME farm 250 performs the operations indicated byEquations 4a, 4b, described above for RSA decryption.

The modular arithmetic block 200 also includes modular arithmetic postprocessing blocks 262 that includes one or more blocks to performspecial modular operations, such as a 2048-bit modular exponentiationblock and a 128-bit modular addition block, and includes a memory tostore parameters for particular processes. The arithmetic controller 252determines which bits form which operands on which blocks and providescontrol signals for the blocks in the modular arithmetic post processorblocks 262. For example, under control of arithmetic controller 252, themodular arithmetic post processor blocks 262 perform the operationsindicated by Equation 5, described above for RSA decryption. The outputfrom the modular arithmetic post processor blocks 262 are presented asoutput 268 from the modular arithmetic block 200. In the illustratedembodiment, the output 268 is presented in a 1024-bit buffer.

According to the illustrated embodiment, the modular arithmetic block200 also includes a parameter collector block 230 that receives dataindicating the parameters for the cryptographic process. The input tothe modular arithmetic block 200 is provided as input 202 to theparameter collector block 230. In the illustrated embodiment, the input202 is a 1024-bit buffer. For example, the parameter collector block 230receives the moduli P1, P2, M, receives the pre-computed parameters D1,D2, F1, F2 computed using equations 3a, 3b, 3c, 3d, and receivesciphertext C, described above for RSA decryption, all in a series of1024-bit signals through input 202. In some embodiments, the parametercollector block 230 also receives parameters such as MU1 and MU2 whichare determined by P1 and P2, respectively, and which are described inmore detail below.

A data bus 204 carries data from the parameter collector block 230 tothe ME farm 250 and the modular arithmetic post processor blocks 262.The data bus 204 includes channels 204 a that go directly to the modulararithmetic post processor blocks 262 as well as channels 204 b that gointo the ME farm 250 and channels 204 c that come out of the ME farm250. In an illustrated embodiment, the data bus 204 includes 2048channels to transfer 2048 bits in each processing cycle, including 128bits to each of the 16 ME blocks 250 a, 250 b, 250 c, 250 d, 251 in MEfarm 250. In some embodiments, the data bus 204 includes fewer channelsand transfers data using additional processing cycles. In some otherembodiments, the data bus 204 includes more channels and transfers datain fewer processing cycles. The bits received at parameter block 230 aredirected to the ME farm 250 or the modular arithmetic post processorblocks 262 or to other blocks, described below, under the control of thearithmetic controller 252.

According to the illustrated embodiment, the modular arithmetic block200 also includes a large input modular reduction block 210, a smallMontgomery Constant block 220 a and a large Montgomery Constant block220 b. In the illustrated embodiment, the parameter collector block 230communicates two ways with each of these three blocks using a 128-bitdata bus represented by the double-headed solid arrows in FIG. 2.

The large input modular reduction block 210 performs the computations ofEquations 3a, 3b, described above for RSA decryption. More details onthe large input modular reduction block 210 are described below withreference to FIG. 5A, FIG. 5B and FIG. 6. KM represents the number ofbits to hold modulus M; K1 represents the number of bits to hold modulusP1, and K2 represents the number of bits to hold modulus P2. Accordingto the illustrated embodiment, the large input reduction block 210determines the residue of the text C for both modulus P1 and P2,combined, in a number of processing cycles N210 that is about 75% of thesum of the number of bits K1 and K2 in the two moduli P1 and P2. Thatis,N210≈0.75*(K1+K2)≈0.75*KM  (8a)

The small Montgomery constant block 220 a performs the computations ofEquation 7c for modulus m=P1 or m=P2, described above for RSAdecryption. The large Montgomery constant block 220 b performs thecomputations of Equation 7c for modulus m=M=P1*P2, described above forRSA decryption. More details on the Montgomery Constant blocks 220 a,220 b are described below with reference to FIG. 4. Km represents thenumber of bits to hold modulus m. According to the illustratedembodiment, the Montgomery constant blocks 220 a, 220 b determine theMontgomery Constant for modulus m in a number of cycles N220 that isabout the number of bits in the modulus m. That is,N220 Km  (8b)N220≈Km  (8b)

Because the Montgomery Constant for modulus M is computed in hardwareinstead of software, the computation is faster, with less latency, thancomputing the Montgomery Constant for modulus M in software.

As shown in FIG. 2, the blocks 210, 220 a, 220 b are connected inparallel to the parameter collection block 230. By connecting theseblocks in parallel, the Montgomery Constant for both small moduli, P1and P2, can be computed by block 220 a, and the modular reduction of Cfor both small moduli can be computed by block 210, while the MontgomeryConstant for the large modulus M is computed by block 220 b. Thus thecomputation of the modular reductions of C and the Montgomery Constants,for both small moduli, can be computed with no increase in latency whilethe Montgomery Constant for the large modulus M is computed.

3.0 Functional Overview

FIG. 3 is a flowchart that illustrates a high level overview ofembodiment 300 of a method for using the modular arithmetic block 200 ofFIG. 2. Although steps are depicted in a particular order in FIG. 3 andsubsequent flowcharts, in other embodiments the steps can be performedin a different order or overlapping in time. For example step 320 may beperformed before step 310 or overlapping in time with step 310.

In step 310, parameters are received for a cryptographic process. Forexample, the parameters P1, P2, M, D1, D2, F1, F2 for RSA decryption arereceived through input 202 at block 230. For purposes of illustration,it is assumed that M, F1, F2 each involve 2048 bits and that the otherparameters each involve 1024 bits or fewer.

In step 320, a large set of text T is received to transform usingcryptographic processing. For example the cipher text C is received atthe parameter collector block 230 to be transformed to plain text Xduring RSA decryption. In other embodiments, other text is received,such as plain text X to be transformed to cipher text C during RSAencryption, or cipher text representing a digital signature is received.For purposes of illustration, it is assumed that the large set of textincludes 2048 bits.

In step 322, the modular reduction of the text modulo a first modulus ofthe small moduli is performed to produce the first text residue. Forexample, the collector block 230 sends the text C and the moduli P1, P2to the large input modular reduction block 210. The large input modularreduction block 210 computes a residue C1 by performing the modularreduction of the cipher text C modulo the modulus P1 in 0.75*K1processing cycles. In the illustrated example, the parameter collectionblock 230 receives the value of the residue C1.

In step 324, the modular reduction of the text modulo a second modulusof the small moduli is performed to produce the second text residue. Forexample, the block 210 computes a residue C2 by performing modularreduction of the cipher text C modulo the modulus P2 in 0.75*K2processing cycles. In the illustrated example, the parameter collectionblock 230 receives the value of the residue C2. If other moduli areinvolved, such as in algorithms using more than two prime factors, themodular reduction of the text modulo the additional moduli are alsoevaluated. According to the RSA decryption process, there are no otherprime factors of M.

In step 340, the Montgomery Constant for the first modulus of the smallmoduli is computed. For example, the collector block 230 sends themodulus P1 to the small Montgomery constant block 220 a. The smallMontgomery Constant block 220 a computes the Montgomery Constant MCP1for modulus P1 in K1 processing cycles. In the illustrated example, theparameter collection block 230 receives the value of MCP1.

In step 342, the Montgomery Constant for the second modulus of the smallmoduli is computed. For example, after K1 processing cycles, thecollector block 230 sends the modulus P2 to the small Montgomeryconstant block 220 a. The small Montgomery Constant block 220 a computesthe Montgomery Constant MCP2 for modulus P2 in K2 additional processingcycles. In the illustrated example, the parameter collection block 230receives the value of MCP2. If other moduli are involved, such as inalgorithms using more than two prime factors, the Montgomery Constantsof the additional moduli are also evaluated. According to the RSAdecryption process, there are no other prime factors of M.

In step 360, the Montgomery Constant for the large modulus is computed.For example, the collector block 230 sends the modulus M to the largeMontgomery constant block 220 b. The large Montgomery Constant block 220b computes the Montgomery Constant MCM for modulus M in KM processingcycles. In the illustrated example, the parameter collection block 230receives the value of MCM.

Steps 320, 340, 360 are illustrated as starting at the same time. Inother embodiments, one or more may start later than others. For example,because it is estimated that step 360 takes more processing cycles tocomplete than steps 320, 322, 324, step 360 is started first in someembodiments. To take advantage of the parallel connections between thecollector 230 and each of the blocks 210, 220 a, 220 b, some embodimentsstart each of steps 320, 340, 360 before any of steps 324, 342, 360complete.

In step 380, the text residues and Montgomery Constants are used tocontinue processing according to the cryptographic algorithms beingemployed. For example, for RSA decryption, the text residues C1, C2, areused according to Equations 4a, 4b to evaluate X1 and X2 by employingtwo ME blocks of the ME farm 250 and the Montgomery Constants MCP1,MCP2. Then the results X1, X2 and parameters F1, F2 are used accordingto Equation 5 to produce plain text X by employing a large, 2048-bitexponentiation using the large Montgomery Constant MCM and a large,2048-bit exponentiation block in the post processing blocks 262.

Using the steps of method 300, the computation of the residues C1, C2 ofC, and the Montgomery Constants MCP1, MCP2 for both small moduli can becomputed with little or no increase in latency while the MontgomeryConstant MCM for the large modulus M is computed.

4.0 Montgomery Constant Computation

FIG. 4 is a flowchart that illustrates one embodiment 400 of a method tocompute a Montgomery Constant for cryptographic processing. In variousembodiments, the method may be implemented in hardware or in software orboth. In the illustrated embodiment, the method is implemented in eachof two hardware blocks: a first block for relatively small moduli, e.g.,1024 bits and less; and a second block for relative large moduli, e.g.1025 to 2048 bits. In other embodiments the boundary between large andsmall moduli may be different. In some embodiment more than two blocksassociated with more than two ranges of moduli sizes may be employed.The method yields a Montgomery Constant for a modulus m having K bits ina number of cycles N=K.

In step 410, a modulus m having up to K bits is received. This can beaccomplished in one or more processing cycles depending on the number ofchannels in the data bus and the size of the modulus m. For example,with a 128-channel data bus capable of transferring 128 bits in onecycle, a modulus of 1024 bits can be received in 8 cycles. The size Kcan be deduced from the modulus m using any method known in the art. Oneapproach is described below with reference to Equation 9a.

In step 420 a variable Z is set to a value of two raised to the power ofK. In hardware this is done by storing a value of 1 in the (K+1) bit ofa register, as counted from the least significant bit. The register isherein called the “Z register” and is big enough to handle the largestmodulus for the block. For example, in a small Montgomery constant blockdesigned for a modulus m up to 1024 bits in size, the Z registerincludes 1025 bits. In a large Montgomery Constant block designed for amodulus up to 2048 bits in size, the Z register includes 2049 bits. Inone embodiment, the K+1 bit is efficiently set to 1 with limited chiparea and limited latency by inputting the value of 1 to a bank ofshifters. The bank of shifters includes a combination of 256-bitshifters, 64-bit shifters, 16-bit shifters, 4-bit shifters and 1-bitshifters.

Steps 424, 430, 432 or 434, and 440 form a loop that is traversed Ktimes. Any manner of forming the loop in hardware or software may beused.

In step 424, the difference is determined between Z and the modulus m bysubtracting m from Z. In step 430, it is determined whether thedifference is negative. If the difference is negative, control passes tostep 432; if not control passes to step 434. The first difference willnot be negative, so control will first pass to step 434.

In step 434 the difference is shifted left one bit, effectively doublingthe difference, and the shifted difference is stored in the a memorylocations such as variable Z in memory or in a special Z register.Control passes to step 440 to determine whether to traverse the loopagain.

If the difference is negative, then the contents of the Z variable (or Zregister) is shifted left one bit, effectively doubling the value of Z,and the shifted result is stored in the variable Z (or Z register).Control passes to step 440 to determine whether to traverse the loopagain.

Step 440 represents a decision point for traversing the loop again. Forexample, if the difference has not been computed K times, then controlpasses back to step 424 to traverse the loop again. If the differencehas been computed K times, the loop ends and control passes to step 450.Therefore the loop consumes K processing cycles, where K is the numberof bits in modulus m.

In step 450, the Montgomery Constant MCm for modulus m is set to thevalue of Z. As defined in Equation 7c, the Montgomery Constant formodulus m is 2^(2K) mod m. For example, the value of Z register isplaced in a buffer that can be read by the parameter collector block230, or an “is valid” flag is set to indicate that the value in the Zregister is the final value after the loop.

Using the method 400 of FIG. 4, the Montgomery Constant can bedetermined in hardware or software. If performed on dedicated circuitblocks or on dedicated general purpose processors, all the MontgomeryConstants associated with a cryptographic process can be determined inparallel, without increasing latency over the number of cycles inherentin the computation of the Montgomery Constant for the larges modulus.

5.0 Modular Reduction Computation

FIG. 5A is a flowchart that illustrates one embodiment of a method toperform modular reduction without precision division or excessivesubtractions. Such a method is desirable over precision division ormultiple subtractions in many cases. In general, if the difference inbit sizes between the cipher text and modulus is greater than 16 bits,repeated subtraction is not used. For example, for a ciphertext having1030 bits and a modulus having 1024 bits, then repeated subtraction maybe used; in contrast, for a ciphertext of 32 bits and a modulus of 12bits, repeated subtraction is not used. If the ciphertext size isgreater than 64 bits, then precision division is not used. Thus, for aciphertext of 1030 bits and a modulus of 1024 bits, precision divisionis not used. For a ciphertext of 32 bits and a modulus of 12 bits, thenprecision division can be used.

FIG. 5A is based on Barrett's algorithm, described in the reference“Implementing the Rivest Shamir and Adleman Public Key EncryptionAlgorithm on a Standard Digital Signal Processor,” P. Barrett, inAdvances in Cryptology—CRYPTO '86 Proceedings, Springer-Verlag, 1987,pp. 311–323 (hereinafter Barrett). The reference does not suggestoptimizing the algorithm for implementation in hardware rather than on ageneral-purpose processor.

According to Barrett, text T has less than 2*K bits where K is thenumber of bits in the modulus P. Given P, K can be computed according toEquation 9a.K=[log₂ P]+1 (9a)where log₂ represents the logarithm operation to the base 2 on thefollowing operand. A factor MU depends on the reciprocal of P accordingto Equation 9b.MU=[2^(2K)] div P  (9b)where div represents an integer result from a division by the followingoperand. MU is independent of the text T being operated on; so MU can bepredetermined and stored when P is defined, and used for several sets oftext T using the same public and private keys without further divisions.A first quantity, Q, is defined by Equation 9c.Q=([T div 2^(K−1) ]*MU) div 2^(K+1)  (9c)A second quantity, S, is defined by Equations 9d and steps listed as 9eand 9fS=(C mod 2^(K+1))−([Q*P] mod 2^(K+1))  (9d)If (S<0) then reset S to S+2^(K+1)  (9e)while (S>P) reset S to S−P  (9f)Resetting S in step 9e amounts to changing the sign bit of a signedinteger. When S is no longer reset, S contains the residue of the text Tmodulo the modulus P.

Embodiments utilize the method in FIG. 5A implemented in software on ageneral-purpose processor or in hardware. The implementation is used asthe large input modular reduction block 210.

In step 510, a modulus P having up to K binary digits is received. Forexample, modulus P1 is received by the large input modular reductionblock 210. Modulus P has up to K bits. If K is not provided as input, Kis determined based on P and Equation 9a.

In step 512, a value for MU is determined as defined in Equation 9b. Insome hardware implementations, MU is pre-computed in software or in adifferent hardware block and passed to the modular reduction block 210and stored there for all computations involving the same keys. Forexample, in RSA decryption embodiments, values of MU for both P1 and P2are received and stored in memory on the modular reduction block 210.

In step 514, a value for the text T, having fewer than 2*K bits, isreceived. For example, 2048 bits of the cipher text C is received.

In step 516, a first temporary variable called the TA variable (or atemporary register called the TA register) is set to the K+1 mostsignificant bits (MSB) of T. This is equivalent to a divide by 2^(K−1),a power of two. A second temporary variable called the TB variable (or atemporary register called the TB register) is set to the K+1 leastsignificant bits (LSB) of T. This is equivalent to modular reduction by2^(K+1), a power of two. In hardware implementations, MSB and LSBselections, and integer division by a power of two, and modularreduction by a power of two, are readily accomplished with small chiparea and few processing cycles using shifters such as the shifters bankdescribed above with reference to step 420 of FIG. 4. The TB variable(or TB register) includes the value for the first term in Equation 9d.

In step 518, the contents of the TA variable (or the TA register) arereset to the product of the former contents and MU. In step 520, thecontents of the TA variable (or the TA register), are reset to the K+1MSB of Q. This is equivalent to a divide by 2^(K+1). Steps 516, 518, 520yield the quantity Q according to Equation 9c.

In step 530, a third temporary variable called the TC variable (or atemporary register called the TC register) is set to the K+1 LSB of theproduct of Q and P, as in the second term of Equation 9d. In oneembodiment, a large multiplier is used to perform the multiply, but onlythe K+1 LSB are stored in the TC register. This embodiment allows allthe steps of method 500 to be completed in a number of processing cyclesthat is about 0.75*K. More details on how to perform step 530 in analternative hardware embodiment are described below with reference toFIG. 5B.

In step 570, the residue variable (or the register called the residueregister), represented by the symbol CP, is set to the difference ofsubtracting from the first term of Equation 9d the second terms ofEquation 9d, stored in the TB and TC variables (or TB and TC registers),respectively. This step completes the evaluation of Equation 9d.

In step 580 a test is performed to determine whether the contents of theresidue variable (or the residue register) represent a negative number.If the contents are not negative, control passes to step 584. If thecontents are negative, control passes to step 582 to reset the contentsof the residue variable (or the residue register) to a positive numberby negating the contents. Control passes to step 584.

In step 584, it is determined whether the contents of the residuevariable (or the residue register) represent a number greater than themodulus P. If so, control passes to step 588 to reset the contents ofthe residue variable (or the residue register) to the differenceobtained by subtracting the modulus P from the contents of the residuevariable (or residue register). Because of the value selected for MU instep 512, step 588 is expected to be performed no more than two times.

A residue computed only with subtractions, would be expected to involveabout 2^(KM−KP) subtractions, where KM is the number of bits in thelarge modulus M and KP is the number of bits in smaller modulus P.Therefore an excessive number of subtractions, and the excessive latencycaused by the excessive subtractions, are avoided using MU in step 512.

In step 586, the value of the residue variable (or the residue register)is output, in any manner known in the art. For example, the contents aremoved to an output buffer. In some embodiments, CP is already in anoutput buffer, and a valid bit is set in the output buffer during step586 to indicate that the contents of the output buffer are valid forreading.

Steps 510 to 570 are repeated for P=P2 having up to K2 bits. In hardwareimplementations, this is accomplished by using the same hardwarecomponents in later processing cycles with different inputs.

Using the steps of method 500, Barrett's algorithm can be efficientlyimplemented in hardware at a relatively low cost in terms of chip area(e.g., few temporary registers) and latency.

FIG. 5B is a flowchart that illustrates an alternative embodiment 530 aof a method to perform step 530 of the method 500 of FIG. 5A. In step530 a, the TC register is set to the K+1 LSB of the product of Q and P.The value of Q is stored in the TA register. Since Q is not used afterstep 530, the value in the TA register can be modified during theoperation.

In step 532, the TC register and a counter J are initialized with allzeros. The counter J is used to track which bits of P have beenmultiplied by Q.

In step 534 a group size G is determined, which indicates how many bitsof P are multiplied by Q during each processing cycle. There is a tradeoff between the size of sub-block devoted to computing the modularproduct and the number of processing cycles consumed to yield theproduct. To save size, G is chosen to be much smaller than K. In someembodiments, G=1. In hardware implementations, step 534 is performedonce, at design time when the sub-block to perform the multiplication isdesigned and fabricated.

Steps 538, 540, 542, 544, 546, 548 form a loop that is traversed enoughtimes to multiply every bit in P by Q. Only the bits of Q and P thatcontribute to the K+1 LSB of the product are kept. When G=1, the loop istraversed K times and consumes K cycles. When G>1, the loop is traversedfewer than K times and consumes fewer cycles. Any manner of forming theloop in hardware or software may be used.

In step 538, the value for the counter J during the current traversal ofthe loop is determined. J starts at zero and is incremented by G duringeach traversal. The loop is not traversed if J is greater than K. WhenK+1 divided by G is not an integer, the last bits of P are multiplied byQ using special logic, easily determined by one of ordinary skill.

In step 540, the values of Q*L are determined for 2^(G)−1 values of L.When G=1, the two values of the product are 0 and Q. When G>1, thevalues of the product are 0, Q, . . . 2^(G)−1*Q. The values are storedin an array of registers or on chip memory. The values are readilydetermined in one processing cycle by banks of shifters and adders. Forexample, if G=3, then the array has elements from 0 through 2³−1, whichis 7; i.e. the array has 8 elements from 0 to 7. At each position in thearray is a value of a multiple of Q from 0 to 7*Q. In hardware,completely filling this small array can be performed consuming less chiparea and processing cycles then are consumed by inserting a highprecision multiplication block to form the one product needed.

In step 542, the bits of P to be multiplied by Q are determined andstored in the variable called “FACT” herein. For example, from mostsignificant to least significant bits, FACT is set to the bits inpositions J+G−1 to J of the modulus P. When G=1, FACT is set to the bitin the J position of the modulus P. It is assumed for purposes ofillustration that G=3, J=6 and the 3 bits in the 8^(th), 7^(th) and6^(th) positions of P are “011” which is “3” in decimal notation.

In step 544, the TC register is reset to the contents in the TC registeradded to the value in the array associated with the position given bythe bits in the FACT variable. For example, the bits “011” in the FACTvariable indicate the 3^(rd) position, and the value in the 3^(rd)position of the array is 2*Q. This value 2*Q is then added to the valuealready in the TC register.

In step 546, Q is left shifted by G bits, which is equivalent tomultiplying Q by 2^(G) This step assures that the products computed inthe next traversal of the loop are added to the correct bit positions inTC. To achieve a correct result with such shifting, the memory locationthat holds Q, such as the TA register, should have at least K+1 bits.

Step 548 represents a decision point for traversing the loop again. Ifthe loop is traversed again, because after incrementing J by G, J isstill no greater than K, then control returns to step 538. If, afterincrementing, J is greater than K, control passes to step 550.

In step 550, the remaining bits of P, if any, are multiplied by Q andthe product is added to the TC register.

Using the steps of method 530 a, the product P*Q mod 2^(K+1) can beefficiently implemented in hardware at a cost in terms of chip area andlatency that depends on the choice of G.

6.0 Modular Reduction Block

FIG. 6 is a block diagram of a MR block 210 a to perform modularreduction without precision division or excessive subtractions,according to an embodiment. The MR block 210 a implements the steps ofmethod 500 for RSA decryption.

The MR block 210 a includes two smaller registers 614 a, 614 b (“Pregisters”) for storing data representing the two smaller prime moduliof RSA decryption, P1 and P2, respectively. In the illustratedembodiment the registers 614 a, 614 b hold 1024 bits to accommodatemoduli up to that size. In other embodiment, other boundaries betweensmall and large moduli may be selected. For example, widely used modulussizes may be included in a small register, while larger but more rarelyused modulus sizes may be included in a large register. In someembodiments, the moduli sizes may be divided into more than two ranges.

The MR block 210 a also includes two registers 612 a, 612 b (“MUregisters”) for storing data representing MU1 and MU2, as computed usingEquations 9a and 9b for the two smaller prime moduli, P1 and P2,respectively. In the illustrated embodiment, the registers 612 a, 612 bhold 1025 bits to accommodate MU up to that size. In some embodiments,the values of MU1 and MU2 may be computed in hardware sub-blocks (notshown) based on the values of P1 and P2.

The MR block 210 a also includes one register 610 (a “T register”) forstoring data representing the large input text T, such as cipher text Cor plain text X. In the illustrated embodiment, the register 610 holds2048 bits to accommodate values of C or X up to that size. The register610 is connected to a binary divide sub-block 632 and a binary modsub-block 634. The binary divide sub-block 632 outputs the MSB of thevalue in the T register 610, up to 1023 bits. In the illustratedembodiment, this output is the initial value of TA computed during step516, as indicated by the arrow 633 in FIG. 6. The value may be stored ina TA register (not shown) within the data selection control block 630.The binary mod sub-block 634 outputs the LSB of the value in the Tregister 610, up to 1025 bits. In the illustrated embodiment, thisoutput is the value of TB computed during step 516, as indicated by thearrow 635 in FIG. 6.

Using a 128-bit data bus, it takes 50 processing cycles to load theregisters 610, 612 a, 612 b, 614 a, and 614 b.

The MR block 210 a includes a control logic block 640 and a dataselection control block 630. The control logic block 640 determineswhich values are produced during which processing cycle and providescontrol signals for one or more of the other sub-blocks. The controllogic block 640 includes one or more state machines and counters thattrack the state of the various sub-blocks and the processing cycles.

The data selection block 630 directs data from one or more of theregisters to one or more of the other sub-blocks. For example, in someembodiments, the data selection block includes several multiplexers anda multiplexer control component. The two P registers 614 a, 614 b andthe two MU registers 612 a, 612 b are connected as inputs to the dataselection control block 630. In addition, the MSB of text T output bythe binary divide sub-block 632, shown as the output 633, is connectedas an input to the data selection control block 630.

The MR block 210 a includes a multiplier block 650 and a subtracterblock 670. In the illustrated embodiment the subtracter block 670 workswith operands having up to 1152 bits. The subtracter block 670 includestwo operand inputs 672, 674. Operand input 672 accepts values for afirst operand and operand input 672 accepts values for the operand thatis subtracted from the first operand. The subtracter block 670 is usedto perform the subtractions during steps 570 and 582 described abovewith reference to FIG. 5A.

In the illustrated embodiment the multiplier block 650 works withoperands having up to 1025 bits. The multiplier block 650 is used toperform the multiplications during step 518 and 530 described above withreference to FIG. 5A. In the illustrated embodiment, the multiplierblock 650 is based on a bit-serial architecture and includes amultiplier sub-block 652, a large adder 654 and a small adder 652. Thearchitecture was chosen to reduce the overall gate count of the MR block210 a. The multiplier sub-block 652 computes the product of two operandsup to 64 bits in size. The small adder 654 computes the sum of twooperands up to 128 bits in size. The large adder 656 computes the sum oftwo operands up to 1152 bits in size.

Outputs from the data selection control block 630 are directed to thetwo operands of a multiplier 650 or to the input 674 of the subtracter670. For example, during step 518 for the first modulus P1, describedabove with reference to FIG. 5A, the output 633 that contains datarepresenting the initial value of TA is directed to one operand ofmultiplier 650 and the data from MU register 612 a is directed to theother operand. During step 530 for the first modulus P1, the output 663that contains data representing Q (as described below) is directed toone operand of multiplier 650 and the data from P register 614 a isdirected to the other operand. In some embodiments, both TA and Q arestored in a TA register in the data selection control block 630. Duringstep 582 for the first modulus P1, the data from P register 614 a isdirected to the input 674 for the subtracted operand on the subtracter670.

The output from the multiplier 650 goes to either a binary dividesub-block 662 or a binary mod sub-block 664, based on a control inputsignal provided by the control logic block 640. For example, during step520, when the MSB of the product of MU and the contents of TA areobtained, the product is directed through binary divide sub-block 662.In the illustrated embodiment, this output from binary divide sub-block662 is the value of Q, the final contents of TA, as indicated by thearrow 663 in FIG. 6. In some embodiments, this output is stored in theTA register of the data selection control block 630. During step 530,when the LSB of the product of P and Q are obtained, the product isdirected through binary mod sub-block 664. In the illustratedembodiment, this output from binary mod sub-block 662 is the contents ofTC, as indicated by the TC register 665 in FIG. 6

During step 570, the data from the TC register 665 is directed to thesubtracted input 674 of subtracter 670; and the TB output 635 frombinary mod sub-block 634 is directed to the other operand, as depictedin FIG. 6. The result is the first estimate of the residue and is placedin the residue register 675. For example, the result is the firstestimate of the residue CP of the cipher text C, such as C1 for thefirst modulus P1.

During step 580, the control logic block 640 determines whether thevalue in the residue register 675 is negative. If so, then the value isthe residue register 675 is negated by the control logic block 640.

During step 584, the control logic block 640 determines whether thevalue in the residue register 675 is greater than the value in the Pregister 614 a or 614 b for the current modulus, P1 or P2, respectively.If so, then another subtraction is performed during step 588. Thissubsequent subtraction is performed by the MR block 210 a. The contentsof the residue register 675 are input to the first input 672 of thesubtracter 670. The contents of one of the moduli, indicated by P1/P2output 637 from the data selection control block 630, are input to thesubtracted input 674 of the subtracter 670.

Therefore, the modular reduction block 210 a is one implementation inhardware for the method 500 depicted in FIG. 5A.

7.0 Hardware Overview

FIG. 7 is a block diagram that illustrates a computer system 700 uponwhich an embodiment of the invention may be implemented. Computer system700 includes a bus 702 or other communication mechanism forcommunicating information, and a processor 704 coupled with bus 702 forprocessing information. Computer system 700 also includes a main memory706, such as a random access memory (RAM) or other dynamic storagedevice, coupled to bus 702 for storing information and instructions tobe executed by processor 704. Main memory 706 also may be used forstoring temporary variables or other intermediate information duringexecution of instructions to be executed by processor 704. Computersystem 700 further includes a read only memory (ROM) 708 or other staticstorage device coupled to bus 702 for storing static information andinstructions for processor 704. A storage device 710, such as a magneticdisk or optical disk, is provided and coupled to bus 702 for storinginformation and instructions.

Computer system 700 may be coupled via bus 702 to a display 712, such asa cathode ray tube (CRT), for displaying information to a computer user.An input device 714, including alphanumeric and other keys, is coupledto bus 702 for communicating information and command selections toprocessor 704. Another type of user input device is cursor control 716,such as a mouse, a trackball, or cursor direction keys for communicatingdirection information and command selections to processor 704 and forcontrolling cursor movement on display 712. This input device typicallyhas two degrees of freedom in two axes, a first axis (e.g., x) and asecond axis (e.g., y), that allows the device to specify positions in aplane.

The invention is related to the use of computer system 700 forimplementing the techniques described herein. According to oneembodiment of the invention, those techniques are performed by computersystem 700 in response to processor 704 executing one or more sequencesof one or more instructions contained in main memory 706. Suchinstructions may be read into main memory 706 from anothercomputer-readable medium, such as storage device 710. Execution of thesequences of instructions contained in main memory 706 causes processor704 to perform the process steps described herein. In alternativeembodiments, hard-wired circuitry may be used in place of or incombination with software instructions to implement the invention. Thus,embodiments of the invention are not limited to any specific combinationof hardware circuitry and software.

The term “computer-readable medium” as used herein refers to any mediumthat participates in providing instructions to processor 704 forexecution. Such a medium may take many forms, including but not limitedto storage media such as, non-volatile storage media or volatile storagemedia, and transmission media. Non-volatile storage media includes, forexample, optical or magnetic disks, such as storage device 710. Volatilestorage media includes dynamic memory, such as main memory 706.Transmission media includes coaxial cables, copper wire and fiberoptics, including the wires that comprise bus 702. Transmission mediacan also take the form of acoustic or light waves, such as thosegenerated during radio wave and infrared data communications.

Common forms of computer-readable storage media include, for example, afloppy disk, a flexible disk, hard disk, magnetic tape, or any othermagnetic medium, a CD-ROM, punch cards, paper tape, any other physicalmedium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM,any other memory chip or memory cartridge, or any other storage mediumfrom which a computer can read.

Various forms of computer readable media may be involved in carrying oneor more sequences of one or more instructions to processor 704 forexecution. For example, the instructions may initially be carried on amagnetic disk of a remote computer. The remote computer can load theinstructions into its dynamic memory and send the instructions over atelephone line using a modem. A modem local to computer system 700 canreceive the data on the telephone line and use an infrared transmitterto convert the data to an infrared signal. An infrared detector canreceive the data carried in the infrared signal and appropriatecircuitry can place the data on bus 702. Bus 702 carries the data tomain memory 706, from which processor 704 retrieves and executes theinstructions. The instructions received by main memory 706 mayoptionally be stored on storage device 710 either before or afterexecution by processor 704.

Computer system 700 also includes a communication interface 718 coupledto bus 702. Communication interface 718 provides a two-way datacommunication coupling to a network link 720 that is connected to alocal network 722. For example, communication interface 718 may be anintegrated services digital network (ISDN) card or a modem to provide adata communication connection to a corresponding type of telephone line.As another example, communication interface 718 may be a local areanetwork (LAN) card to provide a data communication connection to acompatible LAN. Wireless links may also be implemented. In any suchimplementation, communication interface 718 sends and receiveselectrical, electromagnetic or optical signals that carry digital datastreams representing various types of information.

Network link 720 typically provides data communication through one ormore networks to other data devices. For example, network link 720 mayprovide a connection through local network 722 to a host computer 724 orto data equipment operated by an Internet Service Provider (ISP) 726.ISP 726 in turn provides data communication services through theworldwide packet data communication network now commonly referred to asthe “Internet” 728. Local network 722 and Internet 728 both useelectrical, electromagnetic or optical signals that carry digital datastreams. The signals through the various networks and the signals onnetwork link 720 and through communication interface 718, which carrythe digital data to and from computer system 700, are exemplary forms ofcarrier waves transporting the information.

Computer system 700 can send messages and receive data, includingprogram code, through the network(s), network link 720 and communicationinterface 718. In the Internet example, a server 730 might transmit arequested code for an application program through Internet 728, ISP 726,local network 722 and communication interface 718.

The received code may be executed by processor 704 as it is received,and/or stored in storage device 710, or other non-volatile storage forlater execution. In this manner, computer system 700 may obtainapplication code in the form of a carrier wave.

8.0 Extensions and Alternatives

In the foregoing specification, the invention has been described withreference to specific embodiments thereof. It will, however, be evidentthat various modifications and changes may be made thereto withoutdeparting from the broader spirit and scope of the invention. Thespecification and drawings are, accordingly, to be regarded in anillustrative rather than a restrictive sense.

1. An apparatus for cryptographic data processing, comprising: a firstcircuit configured to determine, based on a modulus and an integer, afirst residue of the integer modulo the modulus, wherein: the modulushas a number of binary digits; and the integer represents eitherplaintext or ciphertext a second circuit configured to determine, basedon the modulus and the number of binary digits, a second residue of tworaised to the power of twice the number of binary digits modulo themodulus; wherein at least a first portion of time during which the firstcircuit determines the first residue is the same as a second portion oftime during which the second circuit determines the second residue; anda third circuit configured to determine, based on the first residue andthe second residue, a cryptographic result.
 2. An apparatus as recitedin claim 1, wherein: the number of binary digits is a first number ofbinary digits; the modulus is a first modulus; a second modulus has asecond number of binary digits; and the apparatus further comprises: afourth circuit configured to determine, based on the second modulus andthe second number of binary digits, a third residue of two raised to thepower of twice the second number of binary digits modulo the secondmodulus; wherein at least a third portion of time during which the firstcircuit determines the first residue is the same as a fourth portion oftime during which the fourth circuit determines the third residue; andwherein at least a fifth portion of time during which the second circuitdetermines the second residue is the same as a sixth portion of timeduring which the fourth circuit determines the third residue.
 3. Anapparatus as recited in claim 1, wherein: the cryptographic result iseither the plaintext or the ciphertext; and the third circuit is furtherconfigured to determine the cryptographic result based on a public-keyprivate-key pair.
 4. An apparatus as recited in claim 1, wherein thethird circuit is further configured to determine the cryptographicresult based on an encryption algorithm selected from the groupconsisting of the Rivest, Shamir, and Adleman (RSA) algorithm, Barrett'salgorithm, the Digital Signature Algorithm (DSA), the Diffie-Hellmanalgorithm, and the Ephemeral Diffie-Hellman algorithm.
 5. An apparatusas recited in claim 1, wherein: the number of binary digits is a firstnumber of binary digits; the modulus is a first modulus; a secondmodulus has a second number of binary digits; the first circuit isfurther configured to determine, based on the second modulus and theinteger, a third residue of the integer modulo the second modulus; thesecond circuit is further configured to determine, based on the secondmodulus and the second number of binary digits, a fourth residue of tworaised to the power of twice the second number of binary digits modulothe second modulus; wherein at least a third portion of time duringwhich the first circuit determines the third residue is the same as afourth portion of time during which the second circuit determines thefourth residue; and the third circuit is further configured to determinethe cryptographic result based on the first residue, the second residue,the third residue, and the fourth residue.
 6. An apparatus as recited inclaim 5, wherein: the first modulus is a first prime number; the secondmodulus is the second prime number; a third modulus is equal to theproduct of the first prime number and the second prime number; the thirdmodulus has a third number of binary digits; the first residue is afirst modular reduction of the integer, based on the first prime number;the second residue is a second modular reduction of the integer, basedon the second prime number; the third residue is a first Montgomeryconstant; the fourth residue is a second Montgomery constant; a fourthcircuit is configured to determine, based on the third modulus and thethird number of binary digits, a fifth reside of two raised to the powerof twice the third number of binary digits modulo the third modulus,wherein the fifth residue is a third Montgomery constant; wherein both(a) the first portion of time during which the first circuit determinesthe first residue and (b) the third portion of time during which thefirst circuit determines the third residue are the same as a fifthportion of time during which the fourth circuit determines the fifthresidue; and wherein both (c) the second portion of time during whichthe second circuit determines the second residue and (b) the fourthportion of time during which the second circuit determines the fourthresidue are the same as a sixth portion of time during which the fourthcircuit determines the fifth residue; and the third circuit is furtherconfigured to determine the cryptographic result, based on Montgomery'smethod, the first modular reduction of the integer, the second modularreduction of the integer, the first Montgomery constant, the secondMontgomery constant, and the third Montgomery constant.
 7. An apparatusas recited in claim 1, wherein the first circuit is further configuredto determine the first residue of the integer modulo the modulus withoutperforming a division by the modulus and without consuming a number ofprocessing cycles as great as the number of binary digits.
 8. Anapparatus as recited in claim 1, wherein the second circuit is furtherconfigured to determine the second residue by being configured to: (a)initialize a data element with a value that represents two raised to thepower of the number of binary digits; (b) determine a difference bysubtracting the modulus from the value in the data element; (c) when thedifference is not negative, shift the value in the data element towardmore significant digits by one binary digit; (d) repeat (b) and (c)until a number of times that step (b) is repeated is equal to the numberof binary digits; and wherein the second residue is equal to the valueof the data element, after the number of times equals the number ofbinary digits.
 9. An apparatus for cryptographic data processing,comprising: means for causing, based on a modulus and an integer, afirst processing means to determine a first residue of the integermodulo the modulus, wherein: the modulus has a number of binary digits;and the integer represents either plaintext or ciphertext means forcausing, based on the modulus and the number of binary digits, a secondprocessing means to determine a second residue of two raised to thepower of twice the number of binary digits modulo the modulus; whereinat least a first portion of time during which the first processing meansdetermines the first residue is the same as a second portion of timeduring which the second processing means determines the second residue;and means for causing, based on the first residue and the secondresidue, a third processing means to determine a cryptographic result.10. An apparatus as recited in claim 9, wherein: the number of binarydigits is a first number of binary digits; the modulus is a firstmodulus; a second modulus has a second number of binary digits; and theapparatus further comprises: means for causing, based on the secondmodulus and the second number of binary digits, a fourth processingmeans to determine a third residue of two raised to the power of twicethe second number of binary digits modulo the second modulus; wherein atleast a third portion of time during which the first processing meansdetermines the first residue is the same as a fourth portion of timeduring which the fourth processing means determines the third residue;and wherein at least a fifth portion of time during which the secondprocessing means determines the second residue is the same as a sixthportion of time during which the fourth processing means determines thethird residue.
 11. An apparatus as recited in claim 9, wherein: thecryptographic result is either the plaintext or the ciphertext; and themeans for causing the third processing means to determine thecryptographic result further comprises means for causing the thirdprocessing means to determine the cryptographic result based on apublic-key private-key pair.
 12. An apparatus as recited in claim 9,wherein the means for causing the third processing means to determinethe cryptographic result further comprises: means for causing the thirdprocessing means to determine the cryptographic result based on anencryption algorithm selected from the group consisting of the Rivest,Shamir, and Adleman (RSA) algorithm, Barrett's algorithm, the DigitalSignature Algorithm (DSA), the Diffie-Hellman algorithm, and theEphemeral Diffie-Hellman algorithm.
 13. An apparatus as recited in claim9, wherein: the number of binary digits is a first number of binarydigits; the modulus is a first modulus; a second modulus has a secondnumber of binary digits; and the apparatus further comprises: means forcausing, based on the second modulus and the integer, the firstprocessing means to determine a third residue of the integer modulo thesecond modulus; means for causing, based on the second modulus and thesecond number of binary digits, the second processing means to determinea fourth residue of two raised to the power of twice the second numberof binary digits modulo the second modulus; wherein at least a thirdportion of time during which the first processing means determines thethird residue is the same as a fourth portion of time during which thesecond processing means determines the fourth residue; and the means forcausing the third processing means to determine the cryptographicresult, based on the first residue and the second residue, furthercomprises: means for causing, based on the first residue, the secondresidue, the third residue, and the fourth residue, the third processingmeans to determine the cryptographic result.
 14. An apparatus as recitedin claim 13, wherein: the first modulus is a first prime number; thesecond modulus is the second prime number; a third modulus is equal tothe product of the first prime number and the second prime number; thethird modulus has a third number of binary digits; the first residue isa first modular reduction of the integer, based on the first primenumber; the second residue is a second modular reduction of the integer,based on the second prime number; the third residue is a firstMontgomery constant; the fourth residue is a second Montgomery constant;the apparatus further comprises: means for causing, based on the thirdmodulus and the second number of binary digits, a fourth processingmeans to determine a fifth reside of two raised to the power of twicethe third number of binary digits modulo the third modulus, wherein thefifth residue is a third Montgomery constant; wherein both (a) the firstportion of time during which the first processing means determines thefirst residue and (b) the third portion of time during which the firstprocessing means determines the third residue are the same as a fifthportion of time during which the fourth processing means determines thefifth residue; and wherein both (c) the second portion of time duringwhich the second processing means determines the second residue and (b)the fourth portion of time during which the second processing meansdetermines the fourth residue are the same as a sixth portion of timeduring which the fourth processing means determines the fifth residue;and the means for causing the third processing means to determine thecryptographic result further comprises means for causing the thirdprocessing means to determine the cryptographic result, based onMontgomery's method, the first modular reduction of the integer, thesecond modular reduction of the integer, the first Montgomery constant,and the second Montgomery constant.
 15. An apparatus as recited in claim9, wherein the means for causing the first processing means to determinethe first residue of the integer modulo the modulus further comprises:means for causing the first processing means to determine the firstresidue of the integer modulo the modulus without performing a divisionby the modulus and without consuming a number of processing cycles asgreat as the number of binary digits.
 16. An apparatus as recited inclaim 9, wherein the means for causing the second processing means todetermine the second residue further comprises: (a) means forinitializing a data element with a value that represents two raised tothe power of the number of binary digits; (b) means for determining adifference by subtracting the modulus from a value represented by thevalue in the data element; (c) means for shifting, when the differenceis not negative, the value in the data element toward more significantdigits by one binary digit; (d) means for repeating the means of (b) and(c) until a number of times that means (b) is repeated is equal to thenumber of binary digits; and wherein the second residue is equal to thevalue of the data element, after the number of times equals the numberof binary digits.
 17. A computer-implemented method for cryptographicdata processing, comprising the steps of: based on a modulus and aninteger, causing a first processing means to determine a first residueof the integer modulo the modulus, wherein: the modulus has a number ofbinary digits; and the integer represents either plaintext or ciphertextbased on the modulus and the number of binary digits, causing a secondprocessing means to determine a second residue of two raised to thepower of twice the number of binary digits modulo the modulus; whereinat least a first portion of time during which the first processing meansdetermines the first residue is the same as a second portion of timeduring which the second processing means determines the second residue;and based on the first residue and the second residue, causing a thirdprocessing means to determine a cryptographic result.
 18. Acomputer-implemented method as recited in claim 17, wherein: the numberof binary digits is a first number of binary digits; the modulus is afirst modulus; a second modulus has a second number of binary digits;and the computer-implemented method further comprises the step of: basedon the second modulus and the second number of binary digits, causing afourth processing means to determine a third residue of two raised tothe power of twice the second number of binary digits modulo the secondmodulus; wherein at least a third portion of time during which the firstprocessing means determines the first residue is the same as a fourthportion of time during which the fourth processing means determines thethird residue; and wherein at least a fifth portion of time during whichthe second processing means determines the second residue is the same asa sixth portion of time during which the fourth processing meansdetermines the third residue.
 19. A computer-implemented method asrecited in claim 17, wherein: the cryptographic result is either theplaintext or the ciphertext; and causing the third processing means todetermine the cryptographic result further comprises the step of causingthe third processing means to determine the cryptographic result basedon a public-key private-key pair.
 20. A computer-implemented method asrecited in claim 17, wherein causing the third processing means todetermine the cryptographic result further comprises the step of:causing the third processing means to determine the cryptographic resultbased on an encryption algorithm selected from the group consisting ofthe Rivest, Shamir, and Adleman (RSA) algorithm, Barrett's algorithm,the Digital Signature Algorithm (DSA), the Diffie-Hellman algorithm, andthe Ephemeral Diffie-Hellman algorithm.
 21. A computer-implementedmethod as recited in claim 17, wherein: the number of binary digits is afirst number of binary digits; the modulus is a first modulus; a secondmodulus has a second number of binary digits; and thecomputer-implemented method further comprises the steps of: based on thesecond modulus and the integer, causing the first processing means todetermine a third residue of the integer modulo the second modulus;based on the second modulus and the second number of binary digits,causing the second processing means to determine a fourth residue of tworaised to the power of twice the second number of binary digits modulothe second modulus; wherein at least a third portion of time duringwhich the first processing means determines the third residue is thesame as a fourth portion of time during which the second processingmeans determines the fourth residue; and causing the third processingmeans to determine the cryptographic result, based on the first residueand the second residue, further comprises the step of: based on thefirst residue, the second residue, the third residue, and the fourthresidue, causing the third processing means to determine thecryptographic result.
 22. A computer-implemented method as recited inclaim 21, wherein: the first modulus is a first prime number; the secondmodulus is the second prime number; a third modulus is equal to theproduct of the first prime number and the second prime number; the thirdmodulus has a third number of binary digits; the first residue is afirst modular reduction of the integer, based on the first prime number;the second residue is a second modular reduction of the integer, basedon the second prime number; the third residue is a first Montgomeryconstant; the fourth residue is a second Montgomery constant; thecomputer-implemented method further comprises the step of: based on thethird modulus and the second number of binary digits, causing a fourthprocessing means to determine a fifth reside of two raised to the powerof twice the third number of binary digits modulo the third modulus,wherein the fifth residue is a third Montgomery constant; wherein both(a) the first portion of time during which the first processing meansdetermines the first residue and (b) the third portion of time duringwhich the first processing means determines the third residue are thesame as a fifth portion of time during which the fourth processing meansdetermines the fifth residue; and wherein both (c) the second portion oftime during which the second processing means determines the secondresidue and (b) the fourth portion of time during which the secondprocessing means determines the fourth residue are the same as a sixthportion of time during which the fourth processing means determines thefifth residue; and causing the third processing means to determine thecryptographic result further comprises causing the third processingmeans to determine the cryptographic result, based on Montgomery'smethod, the first modular reduction of the integer, the second modularreduction of the integer, the first Montgomery constant, and the secondMontgomery constant.
 23. A computer-implemented method as recited inclaim 17, wherein causing the first processing means to determine thefirst residue of the integer modulo the modulus further comprises thestep of: causing the first processing means to determine the firstresidue of the integer modulo the modulus without performing a divisionby the modulus and without consuming a number of processing cycles asgreat as the number of binary digits.
 24. A computer-implemented methodas recited in claim 17, wherein causing the second processing means todetermine the second residue further comprises the steps of: (a)initializing a data element with a value that represents two raised tothe power of the number of binary digits; (b) determining a differenceby subtracting the modulus from a value represented by the value in thedata element; (c) when the difference is not negative, shifting thevalue in the data element toward more significant digits by one binarydigit; (d) repeating computer-implemented steps (b) and (c) until anumber of times that step (b) is repeated is equal to the number ofbinary digits; and wherein the second residue is equal to the value ofthe data element, after the number of times equals the number of binarydigits.
 25. A computer-readable storage medium carrying one or moresequences of instructions for cryptographic data processing, whichinstructions, when executed by one or more processors, cause the one ormore processors to perform the steps of: based on a modulus and aninteger, causing a first processing means to determine a first residueof the integer modulo the modulus, wherein: the modulus has a number ofbinary digits; and the integer represents either plaintext or ciphertextbased on the modulus and the number of binary digits, causing a secondprocessing means to determine a second residue of two raised to thepower of twice the number of binary digits modulo the modulus; whereinat least a first portion of time during which the first processing meansdetermines the first residue is the same as a second portion of timeduring which the second processing means determines the second residue;and based on the first residue and the second residue, causing a thirdprocessing means to determine a cryptographic result.
 26. Acomputer-readable storage medium as recited in claim 25, wherein: thenumber of binary digits is a first number of binary digits; the modulusis a first modulus; a second modulus has a second number of binarydigits; and the computer-readable storage medium further comprisesinstructions which, when executed by the one or more processors, causethe one or more processors to perform the step of: based on the secondmodulus and the second number of binary digits, causing a fourthprocessing means to determine a third residue of two raised to the powerof twice the second number of binary digits modulo the second modulus;wherein at least a third portion of time during which the firstprocessing means determines the first residue is the same as a fourthportion of time during which the fourth processing means determines thethird residue; and wherein at least a fifth portion of time during whichthe second processing means determines the second residue is the same asa sixth portion of time during which the fourth processing meansdetermines the third residue.
 27. A computer-readable storage medium asrecited in claim 25, wherein: the cryptographic result is either theplaintext or the ciphertext; and the instructions for causing the thirdprocessing means to determine the cryptographic result further comprisesinstructions for performing the step of causing the third processingmeans to determine the cryptographic result based on a public-keyprivate-key pair.
 28. A computer-readable storage medium as recited inclaim 25, wherein the instructions for causing the third processingmeans to determine the cryptographic result further comprisesinstructions for performing the step of: causing the third processingmeans to determine the cryptographic result based on an encryptionalgorithm selected from the group consisting of the Rivest, Shamir, andAdleman (RSA) algorithm, Barrett's algorithm, the Digital SignatureAlgorithm (DSA), the Diffie-Hellman algorithm, and the EphemeralDiffie-Hellman algorithm.
 29. A computer-readable storage medium asrecited in claim 25, wherein: the number of binary digits is a firstnumber of binary digits; the modulus is a first modulus; a secondmodulus has a second number of binary digits; and the computer-readablestorage medium further comprises instructions which, when executed bythe one or more processors, cause the one or more processors to performthe step of: based on the second modulus and the integer, causing thefirst processing means to determine a third residue of the integermodulo the second modulus; based on the second modulus and the secondnumber of binary digits, causing the second processing means todetermine a fourth residue of two raised to the power of twice thesecond number of binary digits modulo the second modulus; wherein atleast a third portion of time during which the first processing meansdetermines the third residue is the same as a fourth portion of timeduring which the second processing means determines the fourth residue;and the instructions for causing the third processing means to determinethe cryptographic result, based on the first residue and the secondresidue, further comprises instructions for performing the step of:based on the first residue, the second residue, the third residue, andthe fourth residue, causing the third processing means to determine thecryptographic result.
 30. A computer-readable storage medium as recitedin claim 29, wherein: the first modulus is a first prime number; thesecond modulus is the second prime number; a third modulus is equal tothe product of the first prime number and the second prime number; thethird modulus has a third number of binary digits; the first residue isa first modular reduction of the integer, based on the first primenumber; the second residue is a second modular reduction of the integer,based on the second prime number; the third residue is a firstMontgomery constant; the fourth residue is a second Montgomery constant;the computer-readable storage medium further comprises instructionswhich, when executed by the one or more processors, cause the one ormore processors to perform the step of: based on the third modulus andthe second number of binary digits, causing a fourth processing means todetermine a fifth reside of two raised to the power of twice the thirdnumber of binary digits modulo the third modulus, wherein the fifthresidue is a third Montgomery constant; wherein both (a) the firstportion of time during which the first processing means determines thefirst residue and (b) the third portion of time during which the firstprocessing means determines the third residue are the same as a fifthportion of time during which the fourth processing means determines thefifth residue; and wherein both (c) the second portion of time duringwhich the second processing means determines the second residue and (b)the fourth portion of time during which the second processing meansdetermines the fourth residue are the same as a sixth portion of timeduring which the fourth processing means determines the fifth residue;and the instructions for causing the third processing means to determinethe cryptographic result further comprises instructions for performingthe step of causing the third processing means to determine thecryptographic result, based on Montgomery's method, the first modularreduction of the integer, the second modular reduction of the integer,the first Montgomery constant, and the second Montgomery constant.
 31. Acomputer-readable storage medium as recited in claim 25, wherein theinstructions for causing the first processing means to determine thefirst residue of the integer modulo the modulus further comprisesinstructions for performing the step of: causing the first processingmeans to determine the first residue of the integer modulo the moduluswithout performing a division by the modulus and without consuming anumber of processing cycles as great as the number of binary digits. 32.A computer-readable storage medium as recited in claim 25, wherein theinstructions for causing the second processing means to determine thesecond residue further comprises instructions for performing the stepsof: (a) initializing a data element with a value that represents tworaised to the power of the number of binary digits; (b) determining adifference by subtracting the modulus from a value represented by thevalue in the data element; (c) when the difference is not negative,shifting the value in the data element toward more significant digits byone binary digit; (d) repeating computer-implemented steps (b) and (c)until a number of times that step (b) is repeated is equal to the numberof binary digits; and wherein the second residue is equal to the valueof the data element, after the number of times equals the number ofbinary digits.